Search CORE

MDC Repository

An empirical analysis of training protocols for probabilistic gene finders

Author: Majoros William H
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/12/2004
Field of study

BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. RESULTS: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. CONCLUSIONS: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements

Springer

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

Author: Allen Jonathan E
Majoros William H
Pertea Mihaela
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures. RESULTS: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy. CONCLUSION: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research

Efficient decoding algorithms for generalized hidden Markov model gene finders

Author: Delcher Arthur L
Majoros William H
Pertea Mihaela
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. RESULTS: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. CONCLUSIONS: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques

Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome

Author: Majoros William H
Ohler Uwe
Rach Elizabeth A
Tomancak Pavel
Yuan Hsiang-Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

A map of transcription start sites across the Drosophila genome, providing insights into initiation patterns and spatiotemporal conditions

CiteSeerX

Crossref

Columbia University Academic Commons

MDC Repository

MPG.PuRe

Recommended from our members

Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics

Author: Allen Andrew S.
Copeland Brett
Dhindsa Ryan
Goldstein David B.
Gussow Ayal B.
Majoros William H.
Petrovski Slave
Wang Quanli
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

There is broad agreement that genetic mutations occurring outside of the protein-coding regions play a key role in human disease. Despite this consensus, we are not yet capable of discerning which portions of non-coding sequence are important in the context of human disease. Here, we present Orion, an approach that detects regions of the non-coding genome that are depleted of variation, suggesting that the regions are intolerant of mutations and subject to purifying selection in the human lineage. We show that Orion is highly correlated with known intolerant regions as well as regions that harbor putatively pathogenic variation. This approach provides a mechanism to identify pathogenic variation in the human non-coding genome and will have immediate utility in the diagnostic interpretation of patient genomes and in large case control studies using whole-genome sequences

University of Melbourne Institutional Repository

FigShare

Modeling the Evolution of Regulatory Elements by Simultaneous Detection and Alignment with Phylogenetic Pair HMMs

Author: A Loytynoja
A Siepel
A Siepel
A Viterbi
AL Halpern
AM Moses
AP Boyle
B Langmead
D Stanojevic
DA Pollard
DL Gumucio
DS Hirschberg
G Wray
GP Wagner
I Holmes
J Felsenstein
J Hawkins
JC Bryne
JD Thompson
JL Thorne
K Wong
MS Halfon
MZ Ludwig
MZ Ludwig
N Saitou
PR Ray
R Durbin
R Satija
R Siddharthan
RC Edgar
RK Bradley
RW Lusk
Uwe Ohler
W Huang
WH Majoros
WH Majoros
William H. Majoros
WJ Kent
WJL Quesne
Wyeth W. Wasserman
X He
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation

CiteSeerX

Public Library of Science (PLOS)

Crossref

Public Library of Science (PLOS)

DukeSpace

MDC Repository

Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote

Author: Amedeo Paolo
Asai David J
Badger Jonathan H
Barbeau Rebecca A
Cai Hong
Carlton Jane M
Cherry J. Michael
Collins Kathleen
Coyne Robert S
del Toro Christina
Delcher Arthur L
Eisen Jonathan A
Elde Nels C
Farzad Maryam
Frankel Joseph
Gaertig Jacek
Garg Jyoti
Gorovsky Martin A
Haas Brian J
Hamilton Eileen P
Jones Kristie M
Karrer Kathleen M
Keeling Patrick J
Krieger Cynthia J
Lee Suzanne R
Majoros William H
Manning Gerard
Orias Eduardo
Patron Nicola J
Pearlman Ronald E
Ren Qinghu
Ruzzo Walter L
Ryder Hilary F
Salzberg Steven L
Silva Joana C
Smith Roger K
Stewart B. Andrew
Stover Nicholas A
Sun Lei
Tallon Luke J
Thiagarajan Mathangi
Tsao Che-Chia
Turkewitz Aaron P
Waller Ross F
Wang Yufeng
Weinberg Zasha
Wilamowska Katarzyna
Wilkes David E
Williamson Sondra C
Wloga Dorota
Wortman Jennifer R
Wu Dongying
Wu Martin
Publication venue: Public Library of Science
Publication date: 01/01/2006
Field of study

The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance

epublications@Marquette

eScholarship - University of California